{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Euiq_EOntgQm"
      },
      "source": [
        "#  Introduction to Bioinformatics\n",
        "\n",
        "So far in this tutorial, we've primarily worked on the problems of cheminformatics. We've been interested in seeing how we can use the techniques of machine learning to make predictions about the properties of molecules. In this tutorial, we're going to shift a bit and see how we can use classical computer science techniques and machine learning to tackle problems in bioinformatics.\n",
        "\n",
        "For this, we're going to use the venerable [biopython](https://biopython.org/) library to do some basic bioinformatics. A lot of the material in this notebook is adapted from the extensive official [Biopython tutorial]http://biopython.org/DIST/docs/tutorial/Tutorial.html). We strongly recommend checking out the official tutorial after you work through this notebook!\n",
        "\n",
        "## Colab\n",
        "\n",
        "This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_Bioinformatics.ipynb)\n",
        "\n",
        "## Setup\n",
        "\n",
        "To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true,
          "base_uri": "https://localhost:8080/"
        },
        "id": "g21hWuDwGAsC",
        "outputId": "b0aeddd2-19c0-4e76-93b2-66adaeb26e54"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
            "                                 Dload  Upload   Total   Spent    Left  Speed\n",
            "100  3457  100  3457    0     0  11516      0 --:--:-- --:--:-- --:--:-- 11523\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "add /root/miniconda/lib/python3.10/site-packages to PYTHONPATH\n",
            "INFO:conda_installer:add /root/miniconda/lib/python3.10/site-packages to PYTHONPATH\n",
            "python version: 3.10.12\n",
            "INFO:conda_installer:python version: 3.10.12\n",
            "fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh\n",
            "INFO:conda_installer:fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh\n",
            "done\n",
            "INFO:conda_installer:done\n",
            "installing miniconda to /root/miniconda\n",
            "INFO:conda_installer:installing miniconda to /root/miniconda\n",
            "done\n",
            "INFO:conda_installer:done\n",
            "installing openmm, pdbfixer\n",
            "INFO:conda_installer:installing openmm, pdbfixer\n",
            "added conda-forge to channels\n",
            "INFO:conda_installer:added conda-forge to channels\n",
            "done\n",
            "INFO:conda_installer:done\n",
            "conda packages installation finished!\n",
            "INFO:conda_installer:conda packages installation finished!\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "# conda environments:\n",
            "#\n",
            "base                     /root/miniconda\n",
            "\n"
          ]
        }
      ],
      "source": [
        "!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py\n",
        "import conda_installer\n",
        "conda_installer.install()\n",
        "!/root/miniconda/bin/conda info -e"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 502
        },
        "id": "9k2qhejltgQo",
        "outputId": "de10ba97-2478-4011-a038-8d7f4f150ad2"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting deepchem\n",
            "  Downloading deepchem-2.7.2.dev20230730200710-py3-none-any.whl (827 kB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m827.4/827.4 kB\u001b[0m \u001b[31m7.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.3.1)\n",
            "Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.22.4)\n",
            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.5.3)\n",
            "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.2.2)\n",
            "Requirement already satisfied: scipy>=1.10.1 in /usr/local/lib/python3.10/dist-packages (from deepchem) (1.10.1)\n",
            "Collecting rdkit (from deepchem)\n",
            "  Downloading rdkit-2023.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.7 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m29.7/29.7 MB\u001b[0m \u001b[31m46.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->deepchem) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->deepchem) (2022.7.1)\n",
            "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from rdkit->deepchem) (9.4.0)\n",
            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->deepchem) (3.2.0)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->deepchem) (1.16.0)\n",
            "Installing collected packages: rdkit, deepchem\n",
            "Successfully installed deepchem-2.7.2.dev20230730200710 rdkit-2023.3.2\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:deepchem.feat.molecule_featurizers.rdkit_descriptors:No normalization for AvgIpc. Feature removed!\n",
            "WARNING:deepchem.models.torch_models:Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'\n",
            "WARNING:deepchem.models.torch_models:Skipped loading modules with transformers dependency. No module named 'transformers'\n",
            "WARNING:deepchem.models:cannot import name 'HuggingFaceModel' from 'deepchem.models.torch_models' (/usr/local/lib/python3.10/dist-packages/deepchem/models/torch_models/__init__.py)\n",
            "WARNING:deepchem.models:Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/usr/local/lib/python3.10/dist-packages/deepchem/models/torch_models/__init__.py)\n",
            "WARNING:deepchem.models:Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'\n",
            "WARNING:deepchem.models:Skipped loading some Jax models, missing a dependency. No module named 'haiku'\n"
          ]
        },
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'2.7.2.dev'"
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "!pip install --pre deepchem\n",
        "import deepchem\n",
        "deepchem.__version__"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "APnQxtIKtgQs"
      },
      "source": [
        "We'll use `pip` to install `biopython`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HeYSJWSAtgQt",
        "outputId": "b5ccba5a-045d-47e0-c5d0-c4db009d148a"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Collecting biopython\n",
            "  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m12.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from biopython) (1.22.4)\n",
            "Installing collected packages: biopython\n",
            "Successfully installed biopython-1.81\n"
          ]
        }
      ],
      "source": [
        "!pip install biopython"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "4CxSQrxptgQx",
        "outputId": "9589cb89-7d00-48c5-b65f-553af7aa514b"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'1.81'"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import Bio\n",
        "Bio.__version__"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7eXZ-43CtgQ6",
        "outputId": "083649dc-a45d-4def-d58d-5f6f617f8cc1"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('AGTACACATTG')"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from Bio.Seq import Seq\n",
        "my_seq = Seq(\"AGTACACATTG\")\n",
        "my_seq"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7d8xjbMJxg8D"
      },
      "source": [
        "The complement() method in Biopython's Seq object returns the complement of a DNA sequence. It replaces each base with its complement according to the Watson-Crick base pairing rules. Adenine (A) is complemented by thymine (T), and guanine (G) is complemented by cytosine (C)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hy1zvhrAxycG"
      },
      "source": [
        "The reverse_complement() method in Biopython's Seq object returns the reverse complement of a DNA sequence. It first reverses the sequence and then replaces each base with its complement according to the Watson-Crick base pairing rules. \n",
        "\n",
        "But why is direction important? Many cellular processes occur only along a particular direction. To understand what gives a sense of directionality to a strand of DNA, take a look at the pictures below. Carbon atoms in the backbone of DNA are numbered from 1' to 5' (usually pronounced as \"5 prime\") in a clockwise direction. One might notice that the strand on the left has the 5' carbon above the 3' carbon in every nucleotide, resulting in a strand starting with a 5' end and ending with a 3' end. The strand on the right runs from the 3' end to the 5' end. As hinted earlier, reading of a DNA strand during replication and transcription only occurs from the 3' end to the 5' end.\n",
        "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/f/f2/Nukleotid_num.svg/440px-Nukleotid_num.svg.png\">\n",
        "<img src=\"https://o.quizlet.com/sLgxc-XlkDjlYalnz2Oa-Q_b.jpg\">"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Fd-wViuTtgRB",
        "outputId": "c7157fcf-a21a-475f-a06d-a09d13800fb1"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('TCATGTGTAAC')"
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "my_seq.complement()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GlO-43FNtgRF",
        "outputId": "2e9391eb-daa5-44a1-f808-cf49a42828a8"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('CAATGTGTACT')"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "my_seq.reverse_complement()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W-LumeWptgRJ"
      },
      "source": [
        "## Parsing Sequence Records\n",
        "\n",
        "We're going to download a sample `fasta` file from the Biopython tutorial to use in some exercises. This file is a set of hits for a sequence (of lady slipper orcid genes).It is basically the DNA sequences of a subfamily of plants belonging to Orchids(a diverse group of flowering plants) encoding the ribosomal RNA(rRNA)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "U0A0B3-FtgRK",
        "outputId": "33eb3dba-beca-4d19-9848-4c79657a6512"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "--2023-08-02 14:46:07--  https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 76480 (75K) [text/plain]\n",
            "Saving to: ‘ls_orchid.fasta’\n",
            "\n",
            "\rls_orchid.fasta       0%[                    ]       0  --.-KB/s               \rls_orchid.fasta     100%[===================>]  74.69K  --.-KB/s    in 0.01s   \n",
            "\n",
            "2023-08-02 14:46:07 (5.24 MB/s) - ‘ls_orchid.fasta’ saved [76480/76480]\n",
            "\n"
          ]
        }
      ],
      "source": [
        "!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iNwHJES1tgRP"
      },
      "source": [
        "Let's take a look at what the contents of this file look like:\n",
        "\n",
        "1.   List item\n",
        "2.   List item\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5ZudMHxttgRQ",
        "outputId": "d59e549b-5e6a-435e-f07b-4b7d5ba61d22"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "gi|2765658|emb|Z78533.1|CIZ78533\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')\n",
            "740\n",
            "gi|2765657|emb|Z78532.1|CCZ78532\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')\n",
            "753\n",
            "gi|2765656|emb|Z78531.1|CFZ78531\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')\n",
            "748\n",
            "gi|2765655|emb|Z78530.1|CMZ78530\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')\n",
            "744\n",
            "gi|2765654|emb|Z78529.1|CLZ78529\n",
            "Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')\n",
            "733\n",
            "gi|2765652|emb|Z78527.1|CYZ78527\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')\n",
            "718\n",
            "gi|2765651|emb|Z78526.1|CGZ78526\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')\n",
            "730\n",
            "gi|2765650|emb|Z78525.1|CAZ78525\n",
            "Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')\n",
            "704\n",
            "gi|2765649|emb|Z78524.1|CFZ78524\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')\n",
            "740\n",
            "gi|2765648|emb|Z78523.1|CHZ78523\n",
            "Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG')\n",
            "709\n",
            "gi|2765647|emb|Z78522.1|CMZ78522\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG')\n",
            "700\n",
            "gi|2765646|emb|Z78521.1|CCZ78521\n",
            "Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC')\n",
            "726\n",
            "gi|2765645|emb|Z78520.1|CSZ78520\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TTT')\n",
            "753\n",
            "gi|2765644|emb|Z78519.1|CPZ78519\n",
            "Seq('ATATGATCGAGTGAATCTGGTGGACTTGTGGTTACTCAGCTCGCCATAGGCTTT...TTA')\n",
            "699\n",
            "gi|2765643|emb|Z78518.1|CRZ78518\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATAGTAG...TCC')\n",
            "658\n",
            "gi|2765642|emb|Z78517.1|CFZ78517\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...AGC')\n",
            "752\n",
            "gi|2765641|emb|Z78516.1|CPZ78516\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAT...TAA')\n",
            "726\n",
            "gi|2765640|emb|Z78515.1|MXZ78515\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGCTGAGACCGTAG...AGC')\n",
            "765\n",
            "gi|2765639|emb|Z78514.1|PSZ78514\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...CTA')\n",
            "755\n",
            "gi|2765638|emb|Z78513.1|PBZ78513\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GAG')\n",
            "742\n",
            "gi|2765637|emb|Z78512.1|PWZ78512\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...AGC')\n",
            "762\n",
            "gi|2765636|emb|Z78511.1|PEZ78511\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCCCCA...GGA')\n",
            "745\n",
            "gi|2765635|emb|Z78510.1|PCZ78510\n",
            "Seq('CTAACCAGGGTTCCGAGGTGACCTTCGGGAGGATTCCTTTTTAAGCCCCCGAAA...TTA')\n",
            "750\n",
            "gi|2765634|emb|Z78509.1|PPZ78509\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GGA')\n",
            "731\n",
            "gi|2765633|emb|Z78508.1|PLZ78508\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TGA')\n",
            "741\n",
            "gi|2765632|emb|Z78507.1|PLZ78507\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCCCCA...TGA')\n",
            "740\n",
            "gi|2765631|emb|Z78506.1|PLZ78506\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...TGA')\n",
            "727\n",
            "gi|2765630|emb|Z78505.1|PSZ78505\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TTT')\n",
            "711\n",
            "gi|2765629|emb|Z78504.1|PKZ78504\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCGCAA...TAA')\n",
            "743\n",
            "gi|2765628|emb|Z78503.1|PCZ78503\n",
            "Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCCTTGTTGAGACCGCCA...TAA')\n",
            "727\n",
            "gi|2765627|emb|Z78502.1|PBZ78502\n",
            "Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGACCGCCA...CGC')\n",
            "757\n",
            "gi|2765626|emb|Z78501.1|PCZ78501\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...AGA')\n",
            "770\n",
            "gi|2765625|emb|Z78500.1|PWZ78500\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGCTCATTGTTGAGACCGCAA...AAG')\n",
            "767\n",
            "gi|2765624|emb|Z78499.1|PMZ78499\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAGGGATCATTGTTGAGATCGCAT...ACC')\n",
            "759\n",
            "gi|2765623|emb|Z78498.1|PMZ78498\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAAGGTCATTGTTGAGATCACAT...AGC')\n",
            "750\n",
            "gi|2765622|emb|Z78497.1|PDZ78497\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')\n",
            "788\n",
            "gi|2765621|emb|Z78496.1|PAZ78496\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC')\n",
            "774\n",
            "gi|2765620|emb|Z78495.1|PEZ78495\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GTG')\n",
            "789\n",
            "gi|2765619|emb|Z78494.1|PNZ78494\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGGTCGCAT...AAG')\n",
            "688\n",
            "gi|2765618|emb|Z78493.1|PGZ78493\n",
            "Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...CCC')\n",
            "719\n",
            "gi|2765617|emb|Z78492.1|PBZ78492\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...ATA')\n",
            "743\n",
            "gi|2765616|emb|Z78491.1|PCZ78491\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC')\n",
            "737\n",
            "gi|2765615|emb|Z78490.1|PFZ78490\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')\n",
            "728\n",
            "gi|2765614|emb|Z78489.1|PDZ78489\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGC')\n",
            "740\n",
            "gi|2765613|emb|Z78488.1|PTZ78488\n",
            "Seq('CTGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGCAATAATTGATCGA...GCT')\n",
            "696\n",
            "gi|2765612|emb|Z78487.1|PHZ78487\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TAA')\n",
            "732\n",
            "gi|2765611|emb|Z78486.1|PBZ78486\n",
            "Seq('CGTCACGAGGTTTCCGTAGGTGAATCTGCGGGAGGATCATTGTTGAGATCACAT...TGA')\n",
            "731\n",
            "gi|2765610|emb|Z78485.1|PHZ78485\n",
            "Seq('CTGAACCTGGTGTCCGAAGGTGAATCTGCGGATGGATCATTGTTGAGATATCAT...GTA')\n",
            "735\n",
            "gi|2765609|emb|Z78484.1|PCZ78484\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGGGGAAGGATCATTGTTGAGATCACAT...TTT')\n",
            "720\n",
            "gi|2765608|emb|Z78483.1|PVZ78483\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA')\n",
            "740\n",
            "gi|2765607|emb|Z78482.1|PEZ78482\n",
            "Seq('TCTACTGCAGTGACCGAGATTTGCCATCGAGCCTCCTGGGAGCTTTCTTGCTGG...GCA')\n",
            "629\n",
            "gi|2765606|emb|Z78481.1|PIZ78481\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')\n",
            "572\n",
            "gi|2765605|emb|Z78480.1|PGZ78480\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')\n",
            "587\n",
            "gi|2765604|emb|Z78479.1|PPZ78479\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGT')\n",
            "700\n",
            "gi|2765603|emb|Z78478.1|PVZ78478\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCAGTGTTGAGATCACAT...GGC')\n",
            "636\n",
            "gi|2765602|emb|Z78477.1|PVZ78477\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC')\n",
            "716\n",
            "gi|2765601|emb|Z78476.1|PGZ78476\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC')\n",
            "592\n",
            "gi|2765600|emb|Z78475.1|PSZ78475\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT')\n",
            "716\n",
            "gi|2765599|emb|Z78474.1|PKZ78474\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGT...CTT')\n",
            "733\n",
            "gi|2765598|emb|Z78473.1|PSZ78473\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG')\n",
            "626\n",
            "gi|2765597|emb|Z78472.1|PLZ78472\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')\n",
            "737\n",
            "gi|2765596|emb|Z78471.1|PDZ78471\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')\n",
            "740\n",
            "gi|2765595|emb|Z78470.1|PPZ78470\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT')\n",
            "574\n",
            "gi|2765594|emb|Z78469.1|PHZ78469\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT')\n",
            "594\n",
            "gi|2765593|emb|Z78468.1|PAZ78468\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...GTT')\n",
            "610\n",
            "gi|2765592|emb|Z78467.1|PSZ78467\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA')\n",
            "730\n",
            "gi|2765591|emb|Z78466.1|PPZ78466\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC')\n",
            "641\n",
            "gi|2765590|emb|Z78465.1|PRZ78465\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC')\n",
            "702\n",
            "gi|2765589|emb|Z78464.1|PGZ78464\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAGCGGAAGGGTCATTGTTGAGATCACATAATA...AGC')\n",
            "733\n",
            "gi|2765588|emb|Z78463.1|PGZ78463\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGTTCATTGTTGAGATCACAT...AGC')\n",
            "738\n",
            "gi|2765587|emb|Z78462.1|PSZ78462\n",
            "Seq('CGTCACGAGGTCTCCGGATGTGACCCTGCGGAAGGATCATTGTTGAGATCACAT...CAT')\n",
            "736\n",
            "gi|2765586|emb|Z78461.1|PWZ78461\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TAA')\n",
            "732\n",
            "gi|2765585|emb|Z78460.1|PCZ78460\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TTA')\n",
            "745\n",
            "gi|2765584|emb|Z78459.1|PDZ78459\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTT')\n",
            "744\n",
            "gi|2765583|emb|Z78458.1|PHZ78458\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTG')\n",
            "738\n",
            "gi|2765582|emb|Z78457.1|PCZ78457\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GAG')\n",
            "739\n",
            "gi|2765581|emb|Z78456.1|PTZ78456\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC')\n",
            "740\n",
            "gi|2765580|emb|Z78455.1|PJZ78455\n",
            "Seq('CGTAACCAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAGATCACAT...GCA')\n",
            "745\n",
            "gi|2765579|emb|Z78454.1|PFZ78454\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AAC')\n",
            "695\n",
            "gi|2765578|emb|Z78453.1|PSZ78453\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA')\n",
            "745\n",
            "gi|2765577|emb|Z78452.1|PBZ78452\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA')\n",
            "743\n",
            "gi|2765576|emb|Z78451.1|PHZ78451\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGTACCTCCGGAAGGATCATTGTTGAGATCACAT...AGC')\n",
            "730\n",
            "gi|2765575|emb|Z78450.1|PPZ78450\n",
            "Seq('GGAAGGATCATTGCTGATATCACATAATAATTGATCGAGTTAAGCTGGAGGATC...GAG')\n",
            "706\n",
            "gi|2765574|emb|Z78449.1|PMZ78449\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC')\n",
            "744\n",
            "gi|2765573|emb|Z78448.1|PAZ78448\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG')\n",
            "742\n",
            "gi|2765572|emb|Z78447.1|PVZ78447\n",
            "Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATCACAT...AGC')\n",
            "694\n",
            "gi|2765571|emb|Z78446.1|PAZ78446\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...CCC')\n",
            "712\n",
            "gi|2765570|emb|Z78445.1|PUZ78445\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGT')\n",
            "715\n",
            "gi|2765569|emb|Z78444.1|PAZ78444\n",
            "Seq('CGTAACAAGGTTTCCGTAGGGTGAACTGCGGAAGGATCATTGTTGAGATCACAT...ATT')\n",
            "688\n",
            "gi|2765568|emb|Z78443.1|PLZ78443\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG')\n",
            "784\n",
            "gi|2765567|emb|Z78442.1|PBZ78442\n",
            "Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAATTGATCGAGT...AGT')\n",
            "721\n",
            "gi|2765566|emb|Z78441.1|PSZ78441\n",
            "Seq('GGAAGGTCATTGCCGATATCACATAATAATTGATCGAGTTAATCTGGAGGATCT...GAG')\n",
            "703\n",
            "gi|2765565|emb|Z78440.1|PPZ78440\n",
            "Seq('CGTAACAAGGTTTCCGTAGGTGGACCTCCGGGAGGATCATTGTTGAGATCACAT...GCA')\n",
            "744\n",
            "gi|2765564|emb|Z78439.1|PBZ78439\n",
            "Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC')\n",
            "592\n"
          ]
        }
      ],
      "source": [
        "from Bio import SeqIO\n",
        "for seq_record in SeqIO.parse('ls_orchid.fasta', 'fasta'):\n",
        "    print(seq_record.id)\n",
        "    print(repr(seq_record.seq))\n",
        "    print(len(seq_record))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UV-0Mvv-tgRV"
      },
      "source": [
        "## Sequence Objects\n",
        "\n",
        "A large part of the biopython infrastructure deals with tools for handling sequences. These could be DNA sequences, RNA sequences, amino acid sequences or even more exotic constructs. Generally, 'Seq' object can be treated like a normal python string. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kdkqKHmgtgRW",
        "outputId": "bcadfe09-e44f-4619-ae70-0daaec836176"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "ACAGTAGAC\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "Seq('ACAGTAGAC')"
            ]
          },
          "execution_count": 10,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from Bio.Seq import Seq\n",
        "my_seq = Seq(\"ACAGTAGAC\")\n",
        "print(my_seq)\n",
        "my_seq"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dUw07rNrtgRr"
      },
      "source": [
        "If we want to code a protein sequence, we can do that just as easily."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "O6WUnJEftgRs",
        "outputId": "de1246a5-607c-4240-c0ff-aa756b41de52"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('AAAAA')"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "my_prot = Seq(\"AAAAA\")\n",
        "my_prot"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pTPKw7cHtgR3"
      },
      "source": [
        "We can take the length of sequences and index into them like strings."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OkY6Tx60tgR4",
        "outputId": "58f8773c-5ec2-4863-dee0-cae1780a2884"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "5\n"
          ]
        }
      ],
      "source": [
        "print(len(my_prot))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "YSOUpm8FtgR8",
        "outputId": "15a613f8-8af9-48bf-dcdf-b460584acb68"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'A'"
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "my_prot[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PdQ8weemtgR_"
      },
      "source": [
        "You can also use slice notation on sequences to get subsequences."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "U5v3swWFtgSA",
        "outputId": "2c20cc2b-1dbd-45a5-aa89-2a59499fe78a"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('AAA')"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "my_prot[0:3]"
      ]
    },
    {
      "cell_type": "raw",
      "metadata": {
        "id": "Ng_LjVQ6tgSI"
      },
      "source": [
        "You can concatenate sequences if they have the same type so this works."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ZG77QUj2tgSJ",
        "outputId": "5a6d75f9-5a54-44a3-b91b-ccc529aa525a"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('AAAAAAAAAA')"
            ]
          },
          "execution_count": 15,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "my_prot + my_prot"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m2H_DKmNfqNI"
      },
      "source": [
        " Biopython automatically handles the concatenation as both sequences are of the generic alphabet."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "MZ53Yjr1tgSO",
        "outputId": "cafdc17f-7c02-4d38-ba6f-0bfbeaf370f6"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('AAAAAACAGTAGAC')"
            ]
          },
          "execution_count": 16,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "my_prot + my_seq"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_Z-KdC2WtgSR"
      },
      "source": [
        "## Transcription\n",
        "\n",
        "Transcription is the process by which a DNA sequence is converted into messenger RNA. Remember that this is part of the \"central dogma\" of biology in which DNA engenders messenger RNA which engenders proteins. Here's a nice representation of this cycle borrowed from a Khan academy [lesson](https://cdn.kastatic.org/ka-perseus-images/20ce29384b2e7ff0cdea72acaa5b1dbd7287ab00.png).\n",
        "\n",
        "<img src=\"https://cdn.kastatic.org/ka-perseus-images/20ce29384b2e7ff0cdea72acaa5b1dbd7287ab00.png\">"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1ZjlCDDmtgSU"
      },
      "source": [
        "Note from the image above that DNA has two strands. The top strand is typically called the coding strand, and the bottom the template strand. The template strand is used for the actual transcription process of conversion into messenger RNA, but in bioinformatics, it's more common to work with the coding strand because this strand has the same sequence as the RNA transcript (except that RNA has uracil (U) instead of thymine (T)). Let's now see how we can execute a transcription computationally using Biopython.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "TvPiRx_0tgSU",
        "outputId": "c439854d-402e-47b1-9e4c-bee2e641f4dd"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "ATGATCTCGTAA\n"
          ]
        }
      ],
      "source": [
        "from Bio.Seq import Seq\n",
        "coding_dna = Seq(\"ATGATCTCGTAA\")\n",
        "print(coding_dna)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "arGizrBztgSX",
        "outputId": "67f3cbd4-f095-4d8a-d493-53ff9b924a9d"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('TTACGAGATCAT')"
            ]
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "template_dna = coding_dna.reverse_complement()\n",
        "template_dna"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x8FjupA9tgSa"
      },
      "source": [
        "Note that these sequences match those in the image below. You might be confused about why the `template_dna` sequence is shown reversed. The reason is that by convention, the template strand is read in the reverse direction.\n",
        "\n",
        "Let's now see how we can transcribe our `coding_dna` strand into messenger RNA. This will only swap 'T' for 'U' and change the alphabet for our object."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "oo8bBugUtgSa",
        "outputId": "dffba7c3-d346-4c8e-cc62-c378a884a404"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('AUGAUCUCGUAA')"
            ]
          },
          "execution_count": 19,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "messenger_rna = coding_dna.transcribe()\n",
        "messenger_rna"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6UTMKVfAtgSe"
      },
      "source": [
        "We can also perform a \"back-transcription\" to recover the original coding strand from the messenger RNA."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "edClUMputgSf",
        "outputId": "5d806ae5-e427-4bdd-e057-a58a3e99df3c"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('ATGATCTCGTAA')"
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "messenger_rna.back_transcribe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0GqZSdFptgSk"
      },
      "source": [
        "## Translation\n",
        "\n",
        "Translation is the next step in the process, whereby a messenger RNA is transformed into a protein sequence. Here's a beautiful diagram [from Wikipedia](https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:Ribosome_mRNA_translation_en.svg) that lays out the basics of this process.\n",
        "\n",
        "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/Ribosome_mRNA_translation_en.svg/1000px-Ribosome_mRNA_translation_en.svg.png\">\n",
        "\n",
        "Note how 3 nucleotides at a time correspond to one new amino acid added to the growing protein chain. A set of 3 nucleotides which codes for a given amino acid is called a \"codon.\" We can use the `translate()` method on the messenger rna to perform this transformation in code."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7K_pm48HtgSm"
      },
      "source": [
        "messenger_rna.translate()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IiUYHWmRtgSm"
      },
      "source": [
        "The translation can also be performed directly from the coding sequence DNA"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cy8y6y9CtgSn",
        "outputId": "33b2c1b2-554c-4d2f-8ee4-e93b52c48aee"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('MIS*')"
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "coding_dna.translate()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6hHsJnfQtgSq"
      },
      "source": [
        "Let's now consider a longer genetic sequence that has some more interesting structure for us to look at."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "iwpB4lYatgSs",
        "outputId": "e9657957-2909-49ed-8966-a619a927f208"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('MAIVMGR*KGAR*')"
            ]
          },
          "execution_count": 22,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "coding_dna = Seq(\"ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG\")\n",
        "coding_dna.translate()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ulq6Gc06tgSv"
      },
      "source": [
        "In both of the sequences above, '*' represents the [stop codon](https://en.wikipedia.org/wiki/Stop_codon). A stop codon is a sequence of 3 nucleotides that turns off the protein machinery. In DNA, the stop codons are 'TGA', 'TAA', 'TAG'. Note that this latest sequence has multiple stop codons. It's possible to run the machinery up to the first stop codon and pause too."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6uScm61FtgSw",
        "outputId": "9acf5eb7-ebb6-4017-899d-d3115e69af1e"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('MAIVMGR')"
            ]
          },
          "execution_count": 23,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "coding_dna.translate(to_stop=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aVLT-471tgS2"
      },
      "source": [
        "We're going to introduce a bit of terminology here. A complete coding sequence CDS is a nucleotide sequence of messenger RNA which is made of a whole number of codons (that is, the length of the sequence is a multiple of 3), starts with a \"start codon\" and ends with a \"stop codon\". A start codon is basically the opposite of a stop codon and is mostly commonly the sequence \"AUG\", but can be different (especially if you're dealing with something like bacterial DNA).\n",
        "\n",
        "Let's see how we can translate a complete CDS of bacterial messenger RNA."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "iy9-Co_WtgS3",
        "outputId": "905fd871-6738-4dd2-c326-45e958254757"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*\n"
          ]
        }
      ],
      "source": [
        "from Bio.Seq import Seq\n",
        "gene = Seq(\n",
        "    \"GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA\"\n",
        "    \"GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT\"\n",
        "    \"AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT\"\n",
        "    \"TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT\"\n",
        "    \"AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA\"\n",
        ")\n",
        "protein_sequence = gene.translate(table=\"Bacterial\")\n",
        "print(protein_sequence)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "yWmqHt3GtgS6",
        "outputId": "4052fae2-1b30-4493-f15e-7228d2179dc4"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')"
            ]
          },
          "execution_count": 25,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "gene.translate(table=\"Bacterial\", to_stop=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hqQlZA2dtgS8"
      },
      "source": [
        "# Handling Annotated Sequences\n",
        "\n",
        "Sometimes it will be useful for us to be able to handle annotated sequences where there's richer annotations, as in GenBank or EMBL files. For these purposes, we'll want to use the `SeqRecord` class."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "nnHQ_fObtgS9",
        "outputId": "dd914e4a-5501-4ba8-abfa-5f6bc38244d6"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Help on class SeqRecord in module Bio.SeqRecord:\n",
            "\n",
            "class SeqRecord(builtins.object)\n",
            " |  SeqRecord(seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)\n",
            " |  \n",
            " |  A SeqRecord object holds a sequence and information about it.\n",
            " |  \n",
            " |  Main attributes:\n",
            " |   - id          - Identifier such as a locus tag (string)\n",
            " |   - seq         - The sequence itself (Seq object or similar)\n",
            " |  \n",
            " |  Additional attributes:\n",
            " |   - name        - Sequence name, e.g. gene name (string)\n",
            " |   - description - Additional text (string)\n",
            " |   - dbxrefs     - List of database cross references (list of strings)\n",
            " |   - features    - Any (sub)features defined (list of SeqFeature objects)\n",
            " |   - annotations - Further information about the whole sequence (dictionary).\n",
            " |     Most entries are strings, or lists of strings.\n",
            " |   - letter_annotations - Per letter/symbol annotation (restricted\n",
            " |     dictionary). This holds Python sequences (lists, strings\n",
            " |     or tuples) whose length matches that of the sequence.\n",
            " |     A typical use would be to hold a list of integers\n",
            " |     representing sequencing quality scores, or a string\n",
            " |     representing the secondary structure.\n",
            " |  \n",
            " |  You will typically use Bio.SeqIO to read in sequences from files as\n",
            " |  SeqRecord objects.  However, you may want to create your own SeqRecord\n",
            " |  objects directly (see the __init__ method for further details):\n",
            " |  \n",
            " |  >>> from Bio.Seq import Seq\n",
            " |  >>> from Bio.SeqRecord import SeqRecord\n",
            " |  >>> record = SeqRecord(Seq(\"MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\"),\n",
            " |  ...                    id=\"YP_025292.1\", name=\"HokC\",\n",
            " |  ...                    description=\"toxic membrane protein\")\n",
            " |  >>> print(record)\n",
            " |  ID: YP_025292.1\n",
            " |  Name: HokC\n",
            " |  Description: toxic membrane protein\n",
            " |  Number of features: 0\n",
            " |  Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')\n",
            " |  \n",
            " |  If you want to save SeqRecord objects to a sequence file, use Bio.SeqIO\n",
            " |  for this.  For the special case where you want the SeqRecord turned into\n",
            " |  a string in a particular file format there is a format method which uses\n",
            " |  Bio.SeqIO internally:\n",
            " |  \n",
            " |  >>> print(record.format(\"fasta\"))\n",
            " |  >YP_025292.1 toxic membrane protein\n",
            " |  MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n",
            " |  <BLANKLINE>\n",
            " |  \n",
            " |  You can also do things like slicing a SeqRecord, checking its length, etc\n",
            " |  \n",
            " |  >>> len(record)\n",
            " |  44\n",
            " |  >>> edited = record[:10] + record[11:]\n",
            " |  >>> print(edited.seq)\n",
            " |  MKQHKAMIVAIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n",
            " |  >>> print(record.seq)\n",
            " |  MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n",
            " |  \n",
            " |  Methods defined here:\n",
            " |  \n",
            " |  __add__(self, other)\n",
            " |      Add another sequence or string to this sequence.\n",
            " |      \n",
            " |      The other sequence can be a SeqRecord object, a Seq object (or\n",
            " |      similar, e.g. a MutableSeq) or a plain Python string. If you add\n",
            " |      a plain string or a Seq (like) object, the new SeqRecord will simply\n",
            " |      have this appended to the existing data. However, any per letter\n",
            " |      annotation will be lost:\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Quality/solexa_faked.fastq\", \"fastq-solexa\")\n",
            " |      >>> print(\"%s %s\" % (record.id, record.seq))\n",
            " |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n",
            " |      >>> print(list(record.letter_annotations))\n",
            " |      ['solexa_quality']\n",
            " |      \n",
            " |      >>> new = record + \"ACT\"\n",
            " |      >>> print(\"%s %s\" % (new.id, new.seq))\n",
            " |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT\n",
            " |      >>> print(list(new.letter_annotations))\n",
            " |      []\n",
            " |      \n",
            " |      The new record will attempt to combine the annotation, but for any\n",
            " |      ambiguities (e.g. different names) it defaults to omitting that\n",
            " |      annotation.\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> with open(\"GenBank/pBAD30.gb\") as handle:\n",
            " |      ...     plasmid = SeqIO.read(handle, \"gb\")\n",
            " |      >>> print(\"%s %i\" % (plasmid.id, len(plasmid)))\n",
            " |      pBAD30 4923\n",
            " |      \n",
            " |      Now let's cut the plasmid into two pieces, and join them back up the\n",
            " |      other way round (i.e. shift the starting point on this plasmid, have\n",
            " |      a look at the annotated features in the original file to see why this\n",
            " |      particular split point might make sense):\n",
            " |      \n",
            " |      >>> left = plasmid[:3765]\n",
            " |      >>> right = plasmid[3765:]\n",
            " |      >>> new = right + left\n",
            " |      >>> print(\"%s %i\" % (new.id, len(new)))\n",
            " |      pBAD30 4923\n",
            " |      >>> str(new.seq) == str(right.seq + left.seq)\n",
            " |      True\n",
            " |      >>> len(new.features) == len(left.features) + len(right.features)\n",
            " |      True\n",
            " |      \n",
            " |      When we add the left and right SeqRecord objects, their annotation\n",
            " |      is all consistent, so it is all conserved in the new SeqRecord:\n",
            " |      \n",
            " |      >>> new.id == left.id == right.id == plasmid.id\n",
            " |      True\n",
            " |      >>> new.name == left.name == right.name == plasmid.name\n",
            " |      True\n",
            " |      >>> new.description == plasmid.description\n",
            " |      True\n",
            " |      >>> new.annotations == left.annotations == right.annotations\n",
            " |      True\n",
            " |      >>> new.letter_annotations == plasmid.letter_annotations\n",
            " |      True\n",
            " |      >>> new.dbxrefs == left.dbxrefs == right.dbxrefs\n",
            " |      True\n",
            " |      \n",
            " |      However, we should point out that when we sliced the SeqRecord,\n",
            " |      any annotations dictionary or dbxrefs list entries were lost.\n",
            " |      You can explicitly copy them like this:\n",
            " |      \n",
            " |      >>> new.annotations = plasmid.annotations.copy()\n",
            " |      >>> new.dbxrefs = plasmid.dbxrefs[:]\n",
            " |  \n",
            " |  __bool__(self)\n",
            " |      Boolean value of an instance of this class (True).\n",
            " |      \n",
            " |      This behaviour is for backwards compatibility, since until the\n",
            " |      __len__ method was added, a SeqRecord always evaluated as True.\n",
            " |      \n",
            " |      Note that in comparison, a Seq object will evaluate to False if it\n",
            " |      has a zero length sequence.\n",
            " |      \n",
            " |      WARNING: The SeqRecord may in future evaluate to False when its\n",
            " |      sequence is of zero length (in order to better match the Seq\n",
            " |      object behaviour)!\n",
            " |  \n",
            " |  __bytes__(self)\n",
            " |  \n",
            " |  __contains__(self, char)\n",
            " |      Implement the 'in' keyword, searches the sequence.\n",
            " |      \n",
            " |      e.g.\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Fasta/sweetpea.nu\", \"fasta\")\n",
            " |      >>> \"GAATTC\" in record\n",
            " |      False\n",
            " |      >>> \"AAA\" in record\n",
            " |      True\n",
            " |      \n",
            " |      This essentially acts as a proxy for using \"in\" on the sequence:\n",
            " |      \n",
            " |      >>> \"GAATTC\" in record.seq\n",
            " |      False\n",
            " |      >>> \"AAA\" in record.seq\n",
            " |      True\n",
            " |      \n",
            " |      Note that you can also use Seq objects as the query,\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> Seq(\"AAA\") in record\n",
            " |      True\n",
            " |      \n",
            " |      See also the Seq object's __contains__ method.\n",
            " |  \n",
            " |  __eq__(self, other)\n",
            " |      Define the equal-to operand (not implemented).\n",
            " |  \n",
            " |  __format__(self, format_spec)\n",
            " |      Return the record as a string in the specified file format.\n",
            " |      \n",
            " |      This method supports the Python format() function and f-strings.\n",
            " |      The format_spec should be a lower case string supported by\n",
            " |      Bio.SeqIO as a text output file format. Requesting a binary file\n",
            " |      format raises a ValueError. e.g.\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> record = SeqRecord(Seq(\"MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\"),\n",
            " |      ...                    id=\"YP_025292.1\", name=\"HokC\",\n",
            " |      ...                    description=\"toxic membrane protein\")\n",
            " |      ...\n",
            " |      >>> format(record, \"fasta\")\n",
            " |      '>YP_025292.1 toxic membrane protein\\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\\n'\n",
            " |      >>> print(f\"Here is {record.id} in FASTA format:\\n{record:fasta}\")\n",
            " |      Here is YP_025292.1 in FASTA format:\n",
            " |      >YP_025292.1 toxic membrane protein\n",
            " |      MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n",
            " |      <BLANKLINE>\n",
            " |      \n",
            " |      See also the SeqRecord's format() method.\n",
            " |  \n",
            " |  __ge__(self, other)\n",
            " |      Define the greater-than-or-equal-to operand (not implemented).\n",
            " |  \n",
            " |  __getitem__(self, index)\n",
            " |      Return a sub-sequence or an individual letter.\n",
            " |      \n",
            " |      Slicing, e.g. my_record[5:10], returns a new SeqRecord for\n",
            " |      that sub-sequence with some annotation preserved as follows:\n",
            " |      \n",
            " |      * The name, id and description are kept as-is.\n",
            " |      * Any per-letter-annotations are sliced to match the requested\n",
            " |        sub-sequence.\n",
            " |      * Unless a stride is used, all those features which fall fully\n",
            " |        within the subsequence are included (with their locations\n",
            " |        adjusted accordingly). If you want to preserve any truncated\n",
            " |        features (e.g. GenBank/EMBL source features), you must\n",
            " |        explicitly add them to the new SeqRecord yourself.\n",
            " |      * With the exception of any molecule type, the annotations\n",
            " |        dictionary and the dbxrefs list are not used for the new\n",
            " |        SeqRecord, as in general they may not apply to the\n",
            " |        subsequence. If you want to preserve them, you must explicitly\n",
            " |        copy them to the new SeqRecord yourself.\n",
            " |      \n",
            " |      Using an integer index, e.g. my_record[5] is shorthand for\n",
            " |      extracting that letter from the sequence, my_record.seq[5].\n",
            " |      \n",
            " |      For example, consider this short protein and its secondary\n",
            " |      structure as encoded by the PDB (e.g. H for alpha helices),\n",
            " |      plus a simple feature for its histidine self phosphorylation\n",
            " |      site:\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> from Bio.SeqFeature import SeqFeature, SimpleLocation\n",
            " |      >>> rec = SeqRecord(Seq(\"MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT\"\n",
            " |      ...                     \"EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR\"),\n",
            " |      ...                 id=\"1JOY\", name=\"EnvZ\",\n",
            " |      ...                 description=\"Homodimeric domain of EnvZ from E. coli\")\n",
            " |      >>> rec.letter_annotations[\"secondary_structure\"] = \"  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  \"\n",
            " |      >>> rec.features.append(SeqFeature(SimpleLocation(20, 21),\n",
            " |      ...                     type = \"Site\"))\n",
            " |      \n",
            " |      Now let's have a quick look at the full record,\n",
            " |      \n",
            " |      >>> print(rec)\n",
            " |      ID: 1JOY\n",
            " |      Name: EnvZ\n",
            " |      Description: Homodimeric domain of EnvZ from E. coli\n",
            " |      Number of features: 1\n",
            " |      Per letter annotation for: secondary_structure\n",
            " |      Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR')\n",
            " |      >>> rec.letter_annotations[\"secondary_structure\"]\n",
            " |      '  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  '\n",
            " |      >>> print(rec.features[0].location)\n",
            " |      [20:21]\n",
            " |      \n",
            " |      Now let's take a sub sequence, here chosen as the first (fractured)\n",
            " |      alpha helix which includes the histidine phosphorylation site:\n",
            " |      \n",
            " |      >>> sub = rec[11:41]\n",
            " |      >>> print(sub)\n",
            " |      ID: 1JOY\n",
            " |      Name: EnvZ\n",
            " |      Description: Homodimeric domain of EnvZ from E. coli\n",
            " |      Number of features: 1\n",
            " |      Per letter annotation for: secondary_structure\n",
            " |      Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD')\n",
            " |      >>> sub.letter_annotations[\"secondary_structure\"]\n",
            " |      'HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH'\n",
            " |      >>> print(sub.features[0].location)\n",
            " |      [9:10]\n",
            " |      \n",
            " |      You can also of course omit the start or end values, for\n",
            " |      example to get the first ten letters only:\n",
            " |      \n",
            " |      >>> print(rec[:10])\n",
            " |      ID: 1JOY\n",
            " |      Name: EnvZ\n",
            " |      Description: Homodimeric domain of EnvZ from E. coli\n",
            " |      Number of features: 0\n",
            " |      Per letter annotation for: secondary_structure\n",
            " |      Seq('MAAGVKQLAD')\n",
            " |      \n",
            " |      Or for the last ten letters:\n",
            " |      \n",
            " |      >>> print(rec[-10:])\n",
            " |      ID: 1JOY\n",
            " |      Name: EnvZ\n",
            " |      Description: Homodimeric domain of EnvZ from E. coli\n",
            " |      Number of features: 0\n",
            " |      Per letter annotation for: secondary_structure\n",
            " |      Seq('IIEQFIDYLR')\n",
            " |      \n",
            " |      If you omit both, then you get a copy of the original record (although\n",
            " |      lacking the annotations and dbxrefs):\n",
            " |      \n",
            " |      >>> print(rec[:])\n",
            " |      ID: 1JOY\n",
            " |      Name: EnvZ\n",
            " |      Description: Homodimeric domain of EnvZ from E. coli\n",
            " |      Number of features: 1\n",
            " |      Per letter annotation for: secondary_structure\n",
            " |      Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR')\n",
            " |      \n",
            " |      Finally, indexing with a simple integer is shorthand for pulling out\n",
            " |      that letter from the sequence directly:\n",
            " |      \n",
            " |      >>> rec[5]\n",
            " |      'K'\n",
            " |      >>> rec.seq[5]\n",
            " |      'K'\n",
            " |  \n",
            " |  __gt__(self, other)\n",
            " |      Define the greater-than operand (not implemented).\n",
            " |  \n",
            " |  __init__(self, seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)\n",
            " |      Create a SeqRecord.\n",
            " |      \n",
            " |      Arguments:\n",
            " |       - seq         - Sequence, required (Seq or MutableSeq)\n",
            " |       - id          - Sequence identifier, recommended (string)\n",
            " |       - name        - Sequence name, optional (string)\n",
            " |       - description - Sequence description, optional (string)\n",
            " |       - dbxrefs     - Database cross references, optional (list of strings)\n",
            " |       - features    - Any (sub)features, optional (list of SeqFeature objects)\n",
            " |       - annotations - Dictionary of annotations for the whole sequence\n",
            " |       - letter_annotations - Dictionary of per-letter-annotations, values\n",
            " |         should be strings, list or tuples of the same length as the full\n",
            " |         sequence.\n",
            " |      \n",
            " |      You will typically use Bio.SeqIO to read in sequences from files as\n",
            " |      SeqRecord objects.  However, you may want to create your own SeqRecord\n",
            " |      objects directly.\n",
            " |      \n",
            " |      Note that while an id is optional, we strongly recommend you supply a\n",
            " |      unique id string for each record.  This is especially important\n",
            " |      if you wish to write your sequences to a file.\n",
            " |      \n",
            " |      You can create a 'blank' SeqRecord object, and then populate the\n",
            " |      attributes later.\n",
            " |  \n",
            " |  __iter__(self)\n",
            " |      Iterate over the letters in the sequence.\n",
            " |      \n",
            " |      For example, using Bio.SeqIO to read in a protein FASTA file:\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Fasta/loveliesbleeding.pro\", \"fasta\")\n",
            " |      >>> for amino in record:\n",
            " |      ...     print(amino)\n",
            " |      ...     if amino == \"L\": break\n",
            " |      X\n",
            " |      A\n",
            " |      G\n",
            " |      L\n",
            " |      >>> print(record.seq[3])\n",
            " |      L\n",
            " |      \n",
            " |      This is just a shortcut for iterating over the sequence directly:\n",
            " |      \n",
            " |      >>> for amino in record.seq:\n",
            " |      ...     print(amino)\n",
            " |      ...     if amino == \"L\": break\n",
            " |      X\n",
            " |      A\n",
            " |      G\n",
            " |      L\n",
            " |      >>> print(record.seq[3])\n",
            " |      L\n",
            " |      \n",
            " |      Note that this does not facilitate iteration together with any\n",
            " |      per-letter-annotation.  However, you can achieve that using the\n",
            " |      python zip function on the record (or its sequence) and the relevant\n",
            " |      per-letter-annotation:\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> rec = SeqIO.read(\"Quality/solexa_faked.fastq\", \"fastq-solexa\")\n",
            " |      >>> print(\"%s %s\" % (rec.id, rec.seq))\n",
            " |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n",
            " |      >>> print(list(rec.letter_annotations))\n",
            " |      ['solexa_quality']\n",
            " |      >>> for nuc, qual in zip(rec, rec.letter_annotations[\"solexa_quality\"]):\n",
            " |      ...     if qual > 35:\n",
            " |      ...         print(\"%s %i\" % (nuc, qual))\n",
            " |      A 40\n",
            " |      C 39\n",
            " |      G 38\n",
            " |      T 37\n",
            " |      A 36\n",
            " |      \n",
            " |      You may agree that using zip(rec.seq, ...) is more explicit than using\n",
            " |      zip(rec, ...) as shown above.\n",
            " |  \n",
            " |  __le__(self, other)\n",
            " |      Define the less-than-or-equal-to operand (not implemented).\n",
            " |  \n",
            " |  __len__(self)\n",
            " |      Return the length of the sequence.\n",
            " |      \n",
            " |      For example, using Bio.SeqIO to read in a FASTA nucleotide file:\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Fasta/sweetpea.nu\", \"fasta\")\n",
            " |      >>> len(record)\n",
            " |      309\n",
            " |      >>> len(record.seq)\n",
            " |      309\n",
            " |  \n",
            " |  __lt__(self, other)\n",
            " |      Define the less-than operand (not implemented).\n",
            " |  \n",
            " |  __ne__(self, other)\n",
            " |      Define the not-equal-to operand (not implemented).\n",
            " |  \n",
            " |  __radd__(self, other)\n",
            " |      Add another sequence or string to this sequence (from the left).\n",
            " |      \n",
            " |      This method handles adding a Seq object (or similar, e.g. MutableSeq)\n",
            " |      or a plain Python string (on the left) to a SeqRecord (on the right).\n",
            " |      See the __add__ method for more details, but for example:\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Quality/solexa_faked.fastq\", \"fastq-solexa\")\n",
            " |      >>> print(\"%s %s\" % (record.id, record.seq))\n",
            " |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n",
            " |      >>> print(list(record.letter_annotations))\n",
            " |      ['solexa_quality']\n",
            " |      \n",
            " |      >>> new = \"ACT\" + record\n",
            " |      >>> print(\"%s %s\" % (new.id, new.seq))\n",
            " |      slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n",
            " |      >>> print(list(new.letter_annotations))\n",
            " |      []\n",
            " |  \n",
            " |  __repr__(self)\n",
            " |      Return a concise summary of the record for debugging (string).\n",
            " |      \n",
            " |      The python built in function repr works by calling the object's __repr__\n",
            " |      method.  e.g.\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> rec = SeqRecord(Seq(\"MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT\"\n",
            " |      ...                     \"GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ\"\n",
            " |      ...                     \"SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ\"\n",
            " |      ...                     \"PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF\"),\n",
            " |      ...                 id=\"NP_418483.1\", name=\"b4059\",\n",
            " |      ...                 description=\"ssDNA-binding protein\",\n",
            " |      ...                 dbxrefs=[\"ASAP:13298\", \"GI:16131885\", \"GeneID:948570\"])\n",
            " |      >>> print(repr(rec))\n",
            " |      SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF'), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])\n",
            " |      \n",
            " |      At the python prompt you can also use this shorthand:\n",
            " |      \n",
            " |      >>> rec\n",
            " |      SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF'), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])\n",
            " |      \n",
            " |      Note that long sequences are shown truncated. Also note that any\n",
            " |      annotations, letter_annotations and features are not shown (as they\n",
            " |      would lead to a very long string).\n",
            " |  \n",
            " |  __str__(self)\n",
            " |      Return a human readable summary of the record and its annotation (string).\n",
            " |      \n",
            " |      The python built in function str works by calling the object's __str__\n",
            " |      method.  e.g.\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> record = SeqRecord(Seq(\"MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\"),\n",
            " |      ...                    id=\"YP_025292.1\", name=\"HokC\",\n",
            " |      ...                    description=\"toxic membrane protein, small\")\n",
            " |      >>> print(str(record))\n",
            " |      ID: YP_025292.1\n",
            " |      Name: HokC\n",
            " |      Description: toxic membrane protein, small\n",
            " |      Number of features: 0\n",
            " |      Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')\n",
            " |      \n",
            " |      In this example you don't actually need to call str explicitly, as the\n",
            " |      print command does this automatically:\n",
            " |      \n",
            " |      >>> print(record)\n",
            " |      ID: YP_025292.1\n",
            " |      Name: HokC\n",
            " |      Description: toxic membrane protein, small\n",
            " |      Number of features: 0\n",
            " |      Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')\n",
            " |      \n",
            " |      Note that long sequences are shown truncated.\n",
            " |  \n",
            " |  count(self, sub, start=None, end=None)\n",
            " |      Return the number of non-overlapping occurrences of sub in seq[start:end].\n",
            " |      \n",
            " |      Optional arguments start and end are interpreted as in slice notation.\n",
            " |      This method behaves as the count method of Python strings.\n",
            " |  \n",
            " |  format(self, format)\n",
            " |      Return the record as a string in the specified file format.\n",
            " |      \n",
            " |      The format should be a lower case string supported as an output\n",
            " |      format by Bio.SeqIO, which is used to turn the SeqRecord into a\n",
            " |      string.  e.g.\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> record = SeqRecord(Seq(\"MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\"),\n",
            " |      ...                    id=\"YP_025292.1\", name=\"HokC\",\n",
            " |      ...                    description=\"toxic membrane protein\")\n",
            " |      >>> record.format(\"fasta\")\n",
            " |      '>YP_025292.1 toxic membrane protein\\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\\n'\n",
            " |      >>> print(record.format(\"fasta\"))\n",
            " |      >YP_025292.1 toxic membrane protein\n",
            " |      MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n",
            " |      <BLANKLINE>\n",
            " |      \n",
            " |      The Python print function automatically appends a new line, meaning\n",
            " |      in this example a blank line is shown.  If you look at the string\n",
            " |      representation you can see there is a trailing new line (shown as\n",
            " |      slash n) which is important when writing to a file or if\n",
            " |      concatenating multiple sequence strings together.\n",
            " |      \n",
            " |      Note that this method will NOT work on every possible file format\n",
            " |      supported by Bio.SeqIO (e.g. some are for multiple sequences only,\n",
            " |      and binary formats are not supported).\n",
            " |  \n",
            " |  islower(self)\n",
            " |      Return True if all ASCII characters in the record's sequence are lowercase.\n",
            " |      \n",
            " |      If there are no cased characters, the method returns False.\n",
            " |  \n",
            " |  isupper(self)\n",
            " |      Return True if all ASCII characters in the record's sequence are uppercase.\n",
            " |      \n",
            " |      If there are no cased characters, the method returns False.\n",
            " |  \n",
            " |  lower(self)\n",
            " |      Return a copy of the record with a lower case sequence.\n",
            " |      \n",
            " |      All the annotation is preserved unchanged. e.g.\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Fasta/aster.pro\", \"fasta\")\n",
            " |      >>> print(record.format(\"fasta\"))\n",
            " |      >gi|3298468|dbj|BAA31520.1| SAMIPF\n",
            " |      GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG\n",
            " |      VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI\n",
            " |      <BLANKLINE>\n",
            " |      >>> print(record.lower().format(\"fasta\"))\n",
            " |      >gi|3298468|dbj|BAA31520.1| SAMIPF\n",
            " |      gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg\n",
            " |      vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani\n",
            " |      <BLANKLINE>\n",
            " |      \n",
            " |      To take a more annotation rich example,\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> old = SeqIO.read(\"EMBL/TRBG361.embl\", \"embl\")\n",
            " |      >>> len(old.features)\n",
            " |      3\n",
            " |      >>> new = old.lower()\n",
            " |      >>> len(old.features) == len(new.features)\n",
            " |      True\n",
            " |      >>> old.annotations[\"organism\"] == new.annotations[\"organism\"]\n",
            " |      True\n",
            " |      >>> old.dbxrefs == new.dbxrefs\n",
            " |      True\n",
            " |  \n",
            " |  reverse_complement(self, id=False, name=False, description=False, features=True, annotations=False, letter_annotations=True, dbxrefs=False)\n",
            " |      Return new SeqRecord with reverse complement sequence.\n",
            " |      \n",
            " |      By default the new record does NOT preserve the sequence identifier,\n",
            " |      name, description, general annotation or database cross-references -\n",
            " |      these are unlikely to apply to the reversed sequence.\n",
            " |      \n",
            " |      You can specify the returned record's id, name and description as\n",
            " |      strings, or True to keep that of the parent, or False for a default.\n",
            " |      \n",
            " |      You can specify the returned record's features with a list of\n",
            " |      SeqFeature objects, or True to keep that of the parent, or False to\n",
            " |      omit them. The default is to keep the original features (with the\n",
            " |      strand and locations adjusted).\n",
            " |      \n",
            " |      You can also specify both the returned record's annotations and\n",
            " |      letter_annotations as dictionaries, True to keep that of the parent,\n",
            " |      or False to omit them. The default is to keep the original\n",
            " |      annotations (with the letter annotations reversed).\n",
            " |      \n",
            " |      To show what happens to the pre-letter annotations, consider an\n",
            " |      example Solexa variant FASTQ file with a single entry, which we'll\n",
            " |      read in as a SeqRecord:\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Quality/solexa_faked.fastq\", \"fastq-solexa\")\n",
            " |      >>> print(\"%s %s\" % (record.id, record.seq))\n",
            " |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n",
            " |      >>> print(list(record.letter_annotations))\n",
            " |      ['solexa_quality']\n",
            " |      >>> print(record.letter_annotations[\"solexa_quality\"])\n",
            " |      [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]\n",
            " |      \n",
            " |      Now take the reverse complement, here we explicitly give a new\n",
            " |      identifier (the old identifier with a suffix):\n",
            " |      \n",
            " |      >>> rc_record = record.reverse_complement(id=record.id + \"_rc\")\n",
            " |      >>> print(\"%s %s\" % (rc_record.id, rc_record.seq))\n",
            " |      slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT\n",
            " |      \n",
            " |      Notice that the per-letter-annotations have also been reversed,\n",
            " |      although this may not be appropriate for all cases.\n",
            " |      \n",
            " |      >>> print(rc_record.letter_annotations[\"solexa_quality\"])\n",
            " |      [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]\n",
            " |      \n",
            " |      Now for the features, we need a different example. Parsing a GenBank\n",
            " |      file is probably the easiest way to get an nice example with features\n",
            " |      in it...\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> with open(\"GenBank/pBAD30.gb\") as handle:\n",
            " |      ...     plasmid = SeqIO.read(handle, \"gb\")\n",
            " |      >>> print(\"%s %i\" % (plasmid.id, len(plasmid)))\n",
            " |      pBAD30 4923\n",
            " |      >>> plasmid.seq\n",
            " |      Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG')\n",
            " |      >>> len(plasmid.features)\n",
            " |      13\n",
            " |      \n",
            " |      Now, let's take the reverse complement of this whole plasmid:\n",
            " |      \n",
            " |      >>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+\"_rc\")\n",
            " |      >>> print(\"%s %i\" % (rc_plasmid.id, len(rc_plasmid)))\n",
            " |      pBAD30_rc 4923\n",
            " |      >>> rc_plasmid.seq\n",
            " |      Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC')\n",
            " |      >>> len(rc_plasmid.features)\n",
            " |      13\n",
            " |      \n",
            " |      Let's compare the first CDS feature - it has gone from being the\n",
            " |      second feature (index 1) to the second last feature (index -2), its\n",
            " |      strand has changed, and the location switched round.\n",
            " |      \n",
            " |      >>> print(plasmid.features[1])\n",
            " |      type: CDS\n",
            " |      location: [1081:1960](-)\n",
            " |      qualifiers:\n",
            " |          Key: label, Value: ['araC']\n",
            " |          Key: note, Value: ['araC regulator of the arabinose BAD promoter']\n",
            " |          Key: vntifkey, Value: ['4']\n",
            " |      <BLANKLINE>\n",
            " |      >>> print(rc_plasmid.features[-2])\n",
            " |      type: CDS\n",
            " |      location: [2963:3842](+)\n",
            " |      qualifiers:\n",
            " |          Key: label, Value: ['araC']\n",
            " |          Key: note, Value: ['araC regulator of the arabinose BAD promoter']\n",
            " |          Key: vntifkey, Value: ['4']\n",
            " |      <BLANKLINE>\n",
            " |      \n",
            " |      You can check this new location, based on the length of the plasmid:\n",
            " |      \n",
            " |      >>> len(plasmid) - 1081\n",
            " |      3842\n",
            " |      >>> len(plasmid) - 1960\n",
            " |      2963\n",
            " |      \n",
            " |      Note that if the SeqFeature annotation includes any strand specific\n",
            " |      information (e.g. base changes for a SNP), this information is not\n",
            " |      amended, and would need correction after the reverse complement.\n",
            " |      \n",
            " |      Note trying to reverse complement a protein SeqRecord raises an\n",
            " |      exception:\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> protein_rec = SeqRecord(Seq(\"MAIVMGR\"), id=\"Test\",\n",
            " |      ...                         annotations={\"molecule_type\": \"protein\"})\n",
            " |      >>> protein_rec.reverse_complement()\n",
            " |      Traceback (most recent call last):\n",
            " |         ...\n",
            " |      ValueError: Proteins do not have complements!\n",
            " |      \n",
            " |      If you have RNA without any U bases, it must be annotated as RNA\n",
            " |      otherwise it will be treated as DNA by default with A mapped to T:\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> rna1 = SeqRecord(Seq(\"ACG\"), id=\"Test\")\n",
            " |      >>> rna2 = SeqRecord(Seq(\"ACG\"), id=\"Test\", annotations={\"molecule_type\": \"RNA\"})\n",
            " |      >>> print(rna1.reverse_complement(id=\"RC\", description=\"unk\").format(\"fasta\"))\n",
            " |      >RC unk\n",
            " |      CGT\n",
            " |      <BLANKLINE>\n",
            " |      >>> print(rna2.reverse_complement(id=\"RC\", description=\"RNA\").format(\"fasta\"))\n",
            " |      >RC RNA\n",
            " |      CGU\n",
            " |      <BLANKLINE>\n",
            " |      \n",
            " |      Also note you can reverse complement a SeqRecord using a MutableSeq:\n",
            " |      \n",
            " |      >>> from Bio.Seq import MutableSeq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> rec = SeqRecord(MutableSeq(\"ACGT\"), id=\"Test\")\n",
            " |      >>> rec.seq[0] = \"T\"\n",
            " |      >>> print(\"%s %s\" % (rec.id, rec.seq))\n",
            " |      Test TCGT\n",
            " |      >>> rc = rec.reverse_complement(id=True)\n",
            " |      >>> print(\"%s %s\" % (rc.id, rc.seq))\n",
            " |      Test ACGA\n",
            " |  \n",
            " |  translate(self, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None, id=False, name=False, description=False, features=False, annotations=False, letter_annotations=False, dbxrefs=False)\n",
            " |      Return new SeqRecord with translated sequence.\n",
            " |      \n",
            " |      This calls the record's .seq.translate() method (which describes\n",
            " |      the translation related arguments, like table for the genetic code),\n",
            " |      \n",
            " |      By default the new record does NOT preserve the sequence identifier,\n",
            " |      name, description, general annotation or database cross-references -\n",
            " |      these are unlikely to apply to the translated sequence.\n",
            " |      \n",
            " |      You can specify the returned record's id, name and description as\n",
            " |      strings, or True to keep that of the parent, or False for a default.\n",
            " |      \n",
            " |      You can specify the returned record's features with a list of\n",
            " |      SeqFeature objects, or False (default) to omit them.\n",
            " |      \n",
            " |      You can also specify both the returned record's annotations and\n",
            " |      letter_annotations as dictionaries, True to keep that of the parent\n",
            " |      (annotations only), or False (default) to omit them.\n",
            " |      \n",
            " |      e.g. Loading a FASTA gene and translating it,\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> gene_record = SeqIO.read(\"Fasta/sweetpea.nu\", \"fasta\")\n",
            " |      >>> print(gene_record.format(\"fasta\"))\n",
            " |      >gi|3176602|gb|U78617.1|LOU78617 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds\n",
            " |      CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCA\n",
            " |      AAACATGTGAAGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCG\n",
            " |      ACCTTAAGAGCTCCACATAGTTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCT\n",
            " |      TCATTGGTTATGGCAGTGGTCGTCAATGACAGCGATGAAGATGGAGATAGCCGTGACGCA\n",
            " |      GTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAGTTTGTCATAACACTACTCCG\n",
            " |      AGGTTTGTT\n",
            " |      <BLANKLINE>\n",
            " |      \n",
            " |      And now translating the record, specifying the new ID and description:\n",
            " |      \n",
            " |      >>> protein_record = gene_record.translate(table=11,\n",
            " |      ...                                        id=\"phya\",\n",
            " |      ...                                        description=\"translation\")\n",
            " |      >>> print(protein_record.format(\"fasta\"))\n",
            " |      >phya translation\n",
            " |      QAARFLFMKNKVRMIVDCHAKHVKVLQDEKLPFDLTLCGSTLRAPHSCHLQYMANMDSIA\n",
            " |      SLVMAVVVNDSDEDGDSRDAVLPQKKKRLWGLVVCHNTTPRFV\n",
            " |      <BLANKLINE>\n",
            " |  \n",
            " |  upper(self)\n",
            " |      Return a copy of the record with an upper case sequence.\n",
            " |      \n",
            " |      All the annotation is preserved unchanged. e.g.\n",
            " |      \n",
            " |      >>> from Bio.Seq import Seq\n",
            " |      >>> from Bio.SeqRecord import SeqRecord\n",
            " |      >>> record = SeqRecord(Seq(\"acgtACGT\"), id=\"Test\",\n",
            " |      ...                    description = \"Made up for this example\")\n",
            " |      >>> record.letter_annotations[\"phred_quality\"] = [1, 2, 3, 4, 5, 6, 7, 8]\n",
            " |      >>> print(record.upper().format(\"fastq\"))\n",
            " |      @Test Made up for this example\n",
            " |      ACGTACGT\n",
            " |      +\n",
            " |      \"#$%&'()\n",
            " |      <BLANKLINE>\n",
            " |      \n",
            " |      Naturally, there is a matching lower method:\n",
            " |      \n",
            " |      >>> print(record.lower().format(\"fastq\"))\n",
            " |      @Test Made up for this example\n",
            " |      acgtacgt\n",
            " |      +\n",
            " |      \"#$%&'()\n",
            " |      <BLANKLINE>\n",
            " |  \n",
            " |  ----------------------------------------------------------------------\n",
            " |  Data descriptors defined here:\n",
            " |  \n",
            " |  __dict__\n",
            " |      dictionary for instance variables (if defined)\n",
            " |  \n",
            " |  __weakref__\n",
            " |      list of weak references to the object (if defined)\n",
            " |  \n",
            " |  letter_annotations\n",
            " |      Dictionary of per-letter-annotation for the sequence.\n",
            " |      \n",
            " |      For example, this can hold quality scores used in FASTQ or QUAL files.\n",
            " |      Consider this example using Bio.SeqIO to read in an example Solexa\n",
            " |      variant FASTQ file as a SeqRecord:\n",
            " |      \n",
            " |      >>> from Bio import SeqIO\n",
            " |      >>> record = SeqIO.read(\"Quality/solexa_faked.fastq\", \"fastq-solexa\")\n",
            " |      >>> print(\"%s %s\" % (record.id, record.seq))\n",
            " |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN\n",
            " |      >>> print(list(record.letter_annotations))\n",
            " |      ['solexa_quality']\n",
            " |      >>> print(record.letter_annotations[\"solexa_quality\"])\n",
            " |      [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]\n",
            " |      \n",
            " |      The letter_annotations get sliced automatically if you slice the\n",
            " |      parent SeqRecord, for example taking the last ten bases:\n",
            " |      \n",
            " |      >>> sub_record = record[-10:]\n",
            " |      >>> print(\"%s %s\" % (sub_record.id, sub_record.seq))\n",
            " |      slxa_0001_1_0001_01 ACGTNNNNNN\n",
            " |      >>> print(sub_record.letter_annotations[\"solexa_quality\"])\n",
            " |      [4, 3, 2, 1, 0, -1, -2, -3, -4, -5]\n",
            " |      \n",
            " |      Any python sequence (i.e. list, tuple or string) can be recorded in\n",
            " |      the SeqRecord's letter_annotations dictionary as long as the length\n",
            " |      matches that of the SeqRecord's sequence.  e.g.\n",
            " |      \n",
            " |      >>> len(sub_record.letter_annotations)\n",
            " |      1\n",
            " |      >>> sub_record.letter_annotations[\"dummy\"] = \"abcdefghij\"\n",
            " |      >>> len(sub_record.letter_annotations)\n",
            " |      2\n",
            " |      \n",
            " |      You can delete entries from the letter_annotations dictionary as usual:\n",
            " |      \n",
            " |      >>> del sub_record.letter_annotations[\"solexa_quality\"]\n",
            " |      >>> sub_record.letter_annotations\n",
            " |      {'dummy': 'abcdefghij'}\n",
            " |      \n",
            " |      You can completely clear the dictionary easily as follows:\n",
            " |      \n",
            " |      >>> sub_record.letter_annotations = {}\n",
            " |      >>> sub_record.letter_annotations\n",
            " |      {}\n",
            " |      \n",
            " |      Note that if replacing the record's sequence with a sequence of a\n",
            " |      different length you must first clear the letter_annotations dict.\n",
            " |  \n",
            " |  seq\n",
            " |      The sequence itself, as a Seq or MutableSeq object.\n",
            " |  \n",
            " |  ----------------------------------------------------------------------\n",
            " |  Data and other attributes defined here:\n",
            " |  \n",
            " |  __hash__ = None\n",
            "\n"
          ]
        }
      ],
      "source": [
        "from Bio.SeqRecord import SeqRecord\n",
        "help(SeqRecord)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "20ZEptbZtgTC"
      },
      "source": [
        "Let's write a bit of code involving `SeqRecord` and see how it comes out looking."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "id": "yD3E6wrYtgTC"
      },
      "outputs": [],
      "source": [
        "from Bio.SeqRecord import SeqRecord\n",
        "\n",
        "simple_seq = Seq(\"GATC\")\n",
        "simple_seq_r = SeqRecord(simple_seq)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3FItR96PtgTG",
        "outputId": "d54ca548-d2e7-419c-c6e0-4e2d34a62548"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "AC12345\n",
            "Made up sequence\n"
          ]
        }
      ],
      "source": [
        "simple_seq_r.id = \"AC12345\"\n",
        "simple_seq_r.description = \"Made up sequence\"\n",
        "print(simple_seq_r.id)\n",
        "print(simple_seq_r.description)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cxAH3YE0tgTK"
      },
      "source": [
        "Let's now see how we can use `SeqRecord` to parse a large fasta file. We'll pull down a file hosted on the biopython site."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vNxAQJkqtgTL",
        "outputId": "fdb7d821-e8e7-4971-856f-788d3514a74b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "--2023-08-02 14:46:08--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 9853 (9.6K) [text/plain]\n",
            "Saving to: ‘NC_005816.fna’\n",
            "\n",
            "NC_005816.fna       100%[===================>]   9.62K  --.-KB/s    in 0s      \n",
            "\n",
            "2023-08-02 14:46:08 (78.8 MB/s) - ‘NC_005816.fna’ saved [9853/9853]\n",
            "\n"
          ]
        }
      ],
      "source": [
        "!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mvFt3fVqtgTP",
        "outputId": "b38fe587-ac58-4e41-ef2f-08abdceb6dd0"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])"
            ]
          },
          "execution_count": 30,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from Bio import SeqIO\n",
        "\n",
        "record = SeqIO.read(\"NC_005816.fna\", \"fasta\")\n",
        "record"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5_fCYXkttgTW"
      },
      "source": [
        "Note how there's a number of annotations attached to the `SeqRecord` object!\n",
        "\n",
        "Let's take a closer look."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "N7OdmewwtgTa",
        "outputId": "eabde65d-74fb-45df-b2b5-a2fab4601178"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'gi|45478711|ref|NC_005816.1|'"
            ]
          },
          "execution_count": 31,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "record.id"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "156aQviwtgTd",
        "outputId": "2689b07f-596b-4696-ab05-dbe750f05449"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'gi|45478711|ref|NC_005816.1|'"
            ]
          },
          "execution_count": 32,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "record.name"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "Ov2neH1XtgTk",
        "outputId": "ad098a13-875a-4cc4-9a62-7639b6766b80"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'"
            ]
          },
          "execution_count": 33,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "record.description"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HTOwKix8tgTr"
      },
      "source": [
        "Let's now look at the same sequence, but downloaded from GenBank. We'll download the hosted file from the biopython tutorial website as before."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LpqMN5Z_tgTs",
        "outputId": "3f03fe9c-183a-4933-a599-ff275022ca00"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "--2023-08-02 14:46:08--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 31838 (31K) [text/plain]\n",
            "Saving to: ‘NC_005816.gb’\n",
            "\n",
            "NC_005816.gb        100%[===================>]  31.09K  --.-KB/s    in 0.002s  \n",
            "\n",
            "2023-08-02 14:46:08 (12.7 MB/s) - ‘NC_005816.gb’ saved [31838/31838]\n",
            "\n"
          ]
        }
      ],
      "source": [
        "!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PhalU4PRtgTw",
        "outputId": "b41adfd2-6bff-4adf-d7c3-b4abe3af9102"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])"
            ]
          },
          "execution_count": 35,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from Bio import SeqIO\n",
        "\n",
        "record = SeqIO.read(\"NC_005816.gb\", \"genbank\")\n",
        "record"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "inhAKrVItgT2"
      },
      "source": [
        "## SeqIO Objects\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AmHqJFIgyuCp"
      },
      "source": [
        "# .count() Method\n",
        "The .count() method in Biopython's Seq object behaves similar to the .count() method of Python strings. It returns the number of non-overlapping occurrences of a specific subsequence within the sequence."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 36,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "okON0bUHtgT6",
        "outputId": "da2fbe81-2c03-4062-a8cb-eb1074a04bb8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "4\n",
            "1\n"
          ]
        }
      ],
      "source": [
        "from Bio.Seq import Seq\n",
        "my_seq = Seq(\"AGTACACATTG\")\n",
        "count_a = my_seq.count('A')\n",
        "count_tg = my_seq.count('TG')\n",
        "print(count_a)   # Output: 3\n",
        "print(count_tg)  # Output: 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YGj3bJRJ0R0L"
      },
      "source": [
        "# MutableSeq objects"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4-BFyE8Q0br0"
      },
      "source": [
        "Just like the normal Python string, the Seq object is “read only”, or in Python terminology, immutable. Apart from wanting the Seq object to act like a string, this is also a useful default since in many biological applications you want to ensure you are not changing your sequence data:\n",
        "you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {
        "id": "J-HTJIR40VFB"
      },
      "outputs": [],
      "source": [
        " from Bio.Seq import MutableSeq\n",
        " mutable_seq = MutableSeq(\"GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "References:\n",
        "[1]https://www.khanacademy.org/science/ap-biology/gene-expression-and-regulation/transcription-and-rna-processing/a/overview-of-transcription\n",
        "[2]From DNA to RNA. https://www.ncbi.nlm.nih.gov/books/NBK26887/\n",
        "    "
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
